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Abstract 

The problem of minimizing a multivariate function is recurrent in many 
disciplines as Physics, Mathematics, Engeneering and, of course, Computer 
Science. Both deterministic and nondeterministic algorithms have been de¬ 
vised to perform this task. It is common practice to use the nondeterministic 
algorithms when the function to be minimized is not smooth or depends on 
binary variables, as in the case of combinatorial optimization. In this paper 
we describe a simple nondeterministic algorithm which is based on the idea of 
adaptive noise, and that proved to be particularly effective in the minimization 
of a class of multivariate, continuous valued, smooth functions, associated with 
some recent extension of regularization theory by Poggio and Girosi (1990). 
Results obtained by using this method and a more traditional gradient descent 
technique axe also compared. 
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1 Introduction 


In many areas of science, problems of interest, both practical and theoret¬ 
ical, consist in determining a configuration of a set of parameters that is 
“optimal” with respect to some cost function. Therefore, a variety of mini¬ 
mization algorithms have been developed for finding the global minimum of 
a multivariate function. In this short note we describe an algorithm that we 
have successfully used to solve a class of minimization problems that arise in 
the study of the problem of learning from examples. 

The cost functions that we consider, an explicit form of which will be 
presented in the next section, are defined in high dimensional spaces and 
usually show many local minima. Standard descent techniques, as gradient 
descent or conjugate gradient, (Polak, 1971; Acton, 1970; Press et al., 1987), 
are of limited usefulness in this case, since they cannot escape local minima. 
Moreover, although the analytical expression of the gradient is available, 
its computation can be very time consuming, and it would be preferable to 
avoid it. Therefore, it may be better to use nondeterministic techniques, of 
the Metropolis type, (Metropolis et al., 1953), that usually do not require the 
computation of derivatives, and are more suited to deal with the problem of 
local minima. A successful nondeterministic method is simulated annealing 
(Kirpatrick et al., 1983), which however requires the difficult choice of an 
annealing schedule, and has a high computational cost. The algorithm that 
we developed is nondeterministic, does not require any annealing and does 
not have a high computational cost. Albeit we are not guaranteed to found 
the global minimum, experimental results show that, for the class of functions 
we are considering, good local minima are usually found. 

The algorithm we implemented is similar in spirit to many heuristic mini¬ 
mization algorithms, like A* (Hart et al., 1968; Shapiro, 1987), the algorithm 
of Kernighan and Lin (1970, 1973), and the algorithms described by Pearl 
(1984). Our algorithm is also similar in spirit to some “evolutionary” opti¬ 
mization algorithms, like the ones described for example by Wang (1987) and 
Fogel (1988). It is interesting to notice that most of these algorithms have 
been developed and used to solve combinatorial optimization problems, like 
the Traveling Salesman Problem (Lin and Kernighan, 1973), and rarely used 
to minimize smooth functions. We tested our algorithm in a variety of cases, 
all belonging to the class of minimization problems which will be described 
in the next section, and we compared the results with the ones given by a 
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standard technique of gradient descent with adaptive step. In all cases con¬ 
sidered it turns out that the nondeterministic algorithm finds the best local 
minima. Preliminary experiments suggest that this could hold true not only 
for the particular class of functions we are considering, but also for a wider 
class of problems. 

2 Regularization Networks and Minimiza¬ 
tion Problems 

Recently, Poggio and Girosi (1989, 1990) described how standard mathemat¬ 
ical techniques can be used to approach the problem of learning. In fact, 
learning an input-output mapping from a set of examples can be regarded as 
synthesizing an approximation of a multi-dimensional function, that is solv¬ 
ing the problem of surface reconstruction from sparse data. Poggio and Girosi 
used a variational approach, based on regularization theory, (Tikhonov and 
Arsenin, 1977; Morozov, 1984; Bertero, 1986) and showed that this problem 
is equivalent to find the optimal “weights” of a particular type of network 
with one layer of hidden units, called regularization network. Therefore the 
problem of learning becomes equivalent to the following minimization prob¬ 
lem: 

Minimization problem for regularization networks: Let H[f] be the 
functional 


mn = Bk - + *iify/n 2 . (!) 

1=1 

where {(x t -, ?/;) € R d x R}^ is a given sparse data set, P y is a linear operator 
that is radially symmetric in the variable y = Wx, W is a dx d matrix, || • || 
is a norm on the function space which P y f belongs to, and A is a positive 
real number. Having defined the function 

/*(*) = £c Q G(||x- t a ||^v) , (2) 

Oc~ 1 

find {c Q }a=i, {ta}"=i an d W such that H[f m ] is minimum. 
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Here G is the Green’s function of PP, P indicating the adjoint of the 
operator P , and the function /* that minimizes the functional H is the 
solution of the approximation problem. Equations (1) and (2) define the 
class of cost functions which our algorithm has been applied to. 


3 The Algorithm 

This section is dedicated to a rather detailed presentation of the minimization 
algorithm that we propose. We will describe, first, the elemental step of the 
procedure, i.e. what can be considered its core, and later we will outline 
how loops of elemental steps are arranged to form the outer structure of the 
algorithm. 

In what follows, the outcome of the random extraction of a real number 
within the interval [a, b] will be indicated with the scripture random(a , 6), 
and the set of the natural numbers greater than 0 and smaller than n + 1 
will be indicated with the symbol I n . 

Let g = < 7 (x) be a multivariate real function defined in some domain, 
A, of lZ n . Let x be an internal point of A, and let V = {Pi,... ,Pt} be a 
partition of Let us set, for each element P, € P, a positive number u;,-, 
and let fl be the set of all such numbers. The set 0, for reasons that will be 
clear in a moment, is called the noise list. 

An elemental step consists of a series of k perturbations of the current 
point, each of them obtained by adding random noise only to elements that 
belong to a certain subset of its components. In particular, the j-th pertur¬ 
bation of any series will concern elements belonging to Pj only. In this sense, 
partition V realizes a grouping of variables that, for some reason, may be 
considered “similar”. Any perturbation is then accepted - and the current 
point consequently updated - if, and only if, the value taken by the function 
g at the perturbed point is smaller than the value of g at the current point. 
The amplitude of the noise relative to the group of components of the cur¬ 
rent point that the perturbation has modified is either doubled, when the 
perturbation is accepted, or it is halved when the converse case occurs. More 
formally : 
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- g c is set equal to g(x) ; 

- for each element P, of partition V, 

- a n-dimensional vector, v, is generated according to the following 

{ random(—Ui,u>i ) if j € Pi 
0 otherwise 

where Vj indicates the j-th component of v ; 

- g a is set equal to /(x + v) ; 

- if g a < g c , then 

- g c is set equal to g a ; 

- x is set equal to x + v ; 

- u>i is set equal to 2 u>i ; 

else 

- u>i is set equal to o;,/2; 


Performing a single step of the algorithm may result either in the update 
of the noise list or in the update of the noise list and the current point. The 
whole algorithm consists, in essence, in performing a series of such steps, 
keeping track, meanwhile, of the evolution of the current point as well as the 
evolution of the noise list. 

Since elements of the noise list dictate the maximum amplitude of pertur¬ 
bations that will occur in the following step, the noise list undergoes, during 
the minimization process, a fairly intricated dynamics. However, as the cur¬ 
rent point approaches a point of minimum for the function, our expectation 
is that the elements of the noise list become smaller and smaller, although 
their rates of convergence to zero may differ considerably from each other. 

Given these considerations it seems to be sensible to stop the minimization 
process once that elements of the noise list have reached values that can be 
considered small, and no further relevant evolution is expected to take place. 

How to evaluate whether the values contained in the noise list are small 
enough is now matter of choice, and we propose the following criterion : 


4 



elements of the noise list are small enough if 

uji < tj ,V i e I k , 

where y is a given threshold. 

Once that the process has stopped, and the final current point is returned, 
there is no way, in general, to ascertain how close this point is to the actual 
point of global minimum for the function. Hence, the only possibility that 
we are given consists in restarting the procedure using a different initial 
point. However, if we believe that the point we are looking for is not very far 
from the point that we have reached, we may want to restart the procedure 
nearby - both in terms of initial point and noise list - the result we have just 
reached. A practical way to perform this task is to generate the new noise 
list by multiplying the elements of the old one by a given constant. The new 
starting point can be generated, now, by perturbing the old one according 
to the new noise list. 

Results obtained by restarting and performing several times the mini¬ 
mization procedure are then collected, and the point where the function has 
reached its best value is considered the outcome of the whole algorithm. 


4 Experimental Results 

Our algorithm has been tested on the problem of finding the function, be¬ 
longing to the class expressed by eq. (1), which minimizes the functional of 
eq. (2). As poggio and Girosi showed (1989, 1990), solution to this mini¬ 
mization problem best approximates data ?/;, i = 1,..., n. Each experiment 
consisted, then, of the following steps: 

- a data set containing N elements was generated by sampling at random 
a given multivariate function h = /i(x); 

- the number n of eq. (2) was set, and the number of parameters that 
function /* depended on consequently fixed; 

- a minimum for the function H[f*\ - considered as a function of {c Q }" =1 , 
{t a }a =1 and W - was sought; 

- the mean square differences between the function h and its approxima¬ 
tion was computed over a randomly generated test set. 
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In order to evaluate the performances of our algorithm, we decided to 
compare results obtained by using it with the ones obtained by using the 
fairly classical method of gradient descent with adaptive step. One hundred 
iterations of gradient descent and two thousand iterations of our method 
both guaranteed a good degree of convergence and the same computing time 
for the two methods. 

In Table (1) results showing the average behavior of the algorithms, that 
is the average of the mean square error on several experiments, are collected. 

In Table (2) the best performances of the algorithms, that is the lowest 
mean square approximation error obtained in each group of experiments, 
are reported: in this case the nondeterministic algorithm performs better of 
gradient descent, since it has more chances to attain the global minimum. 


Number of variables 

19 

34 

49 

Gradient Descent 

0.11 

0.11 

0.09 

Adaptive Noise 

0.18 

0.13 

0.10 


Table 1 : Errors affecting surface reconstruction obtained by using the 
method that we propose and the gradient descent with adaptive step method 
are compared. Each column shows the averages of the mean square approx¬ 
imation errors coming from a set of 10 experiments. The function to be 
reconstructed was h(x,y) = e~^°' 8x+1 ' 2y ^cos5xsin3y and the mean square ap¬ 
proximation error was computed over a set of 10,000 points. Columns 1, 2, 3 
refer to groups of experiments performed by setting number n of eq. (2) to 5, 
10, and 15 respectively. Therefore, the functions to be minimized depended 
on 19, 34, and 49 variables. For each experiment, 100 iterations of gradient 
descent with adaptive step method, and 2000 iterations of our method were 
performed. 


Number of variables 

19 

34 

49 

Gradient Descent 

0.045 

0.057 

0.044 

Adaptive Noise 

0.015 

0.027 

0.038 


Table 2 : As Table (1), except that it shows in each column the lowest 
mean square approximation error over a group of 10 experiments. 
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Experience we have been collecting by performing hundreds of running of 
both algorithms leads us to believe that the one exhibited in the experiments 
reported is their own typical behavior. 


5 Remarks 

The first aspect of our algorithm to be noticed is that it does not require any 
computation of the value of the gradient. This crucial, of course, whenever an 
explicit expression of the gradient is either unavailable or computationally 
expensive. For example, computation of the value of the gradient for the 
functions we have performed experiments with, was 10 or 20 times more 
time consuming than computation of the value of the function. 

Another aspect of the method is that, unlike many minimization meth¬ 
ods, it does not rest on any particular assumptions about the function to be 
minimized and the structure of its minima. By giving credit to those per¬ 
turbations that yield a decrease of the value of the function, we perform an 
estimate of the gradient that, though very rough, may capture the essential 
trend of the function. In situations of interest of us, this line of action may 
be recommended, since what we want to avoid is to spend time in computing 
a quantity - the gradient - that may end up telling little about location of 
points of minimum for the function. 

Let us finally remark that the introduction of the partition list allows us 
to group the variables according to their typical magnitudes, which, due to 
the different role variables play, may differ considerably. 
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